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Chapter 1 

Introduction 


This document is an adjunct to the final report An Integrated Safety Analysis 
Methodology for Emerging Air Transport Technologies. That report presents the 
results of our analysis of the problem of simultaneous but independent, ap- 
proaches of two aircraft on parallel runways (independent approaches on parallel 
runways, or IAPR). This introductory chapter presents a brief overview and per- 
spective of approaches and methodologies for performing safety analyses for 
complex systems. Ensuing chapters provide the technical details that underlie the 
approach that we have taken in performing the safety analysis for the IAPR con- 
cept. 

Safety Analysis: Perspectives 

A thorough analysis of system safety must address the problem from a variety of 
perspectives — each impacting safety in a different way. One perspective relates to 
the operational environments or operational scenarios within which the system is 
expected to function. Those environments or scenarios that, by their nature, pro- 
vide opportunities for unsafe operating conditions have an impact on system 
safety and approaches to modeling and understanding those impacts must be de- 
veloped. Another perspective relates to the reliability and availability of the func- 
tions performed by the hardware, software, and human components of the system. 
Failures or degradation in the performance of elements of safety -critical system 
components have an impact on safety, and models of the reliability of those ele- 
ments must be developed in order to determine the impact on system safety. Fi- 
nally, the rules and procedures under which a system operates can have a 
significant impact on the system safety, and approaches must be developed to 
analyze the impact of those rules and procedures on safety for all modes of opera- 
tion of the system. 

The three perspectives are illustrated in Figure 1-1 and can be simply summarized 
as follows: 

♦ System element functionality . This entails an analysis of how well and re- 
liably system elements work and the attendant impact on safety. 

♦ Rules and procedures. This involves analyses of how the system rules and 
procedures have been designed to respond in both safe and unsafe situa- 
tions. 
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♦ Operational scenario. This involves analysis of the environment in which 
the system is expected to operate and its attendant impact on system 
safety. 

Figure 1-1. Perspectives of a System-Level Safety Analysis 



Approaches to Evaluating System Safety 

A variety of approaches have been developed to address the three perspectives of 
system safety analysis described in the preceding section. Some of those ap- 
proaches are outlined below. 

♦ Statistical analysis of existing systems (descriptive approaches). 

These approaches are based on statistical analyses of data that are col- 
lected over long periods of time. The work of Professor Arnold Barnett of 
Massachusetts Institute Technology (MIT) is an example of this kind of ef- 
fort. This class of approaches to safety analysis is “after the fact” and is 
useful in identifying shortcomings in existing systems but has limited util- 
ity in predicting the safety consequences of proposed system concepts. 
Since our interest is in evaluation and analysis of the safety of new system 
concepts, we will not dwell on these approaches. 

♦ Analysis of candidate designs that model human, technical (hardware and 
software), and procedural aspects of the system (predictive approaches). 

► “ility” analytical modeling. 

Markov, semi-Markov, combinatorial, and fault tree models are used 
to determine system reliability, availability, maintainability, etc. These 
approaches have matured over time, and the Markov reliability mod- 
eling approach is the one that we have chosen and that is elaborated 
upon in succeeding chapters. Further discussion is included in Chap- 
ter 2. 
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► Simulation. 

A variety of statistical event simulation approaches including discrete 
event simulations, importance sampled Monte Carlo simulations, and 
hybrid simulations with both human operators and hardware in the 
loop have been used to predict the safety of proposed system concepts. 
The advantage of simulations is that they are typically easier to design 
and implement than the analytical models described in (a) above. The 
disadvantage is that, in order to obtain statistically significant results 
for very low probability events, many simulations must be performed. 
The approach that we have taken in the IAPR study is to use a Markov 
model to determine the probabilities of being in potentially unsafe 
system states and to employ a deterministic simulation of the system 
operating in those states. The system safety metrics generated by the 
simulations are then weighted by the Markov state probabilities to ob- 
tain the total expected values of those metrics. Further discussion of 
simulation approaches is included in Chapter 2. 

► Human performance modeling. 

The development of models to predict the effect of workload and task 
design on human error rates and human response times was considered 
beyond the scope of the effort for this task. Indeed, good models of the 
human are critical to the complete analysis of a system. Due to con- 
straints on time and budget, we chose to use simple models of human 
performance and to embed those in our system Markov models. Thus, 
at the level of failure and performance degradation, the function of a 
person is characterized no differently than that of other system compo- 
nents. 

► Formal methods. 

These are mathematical, logic-based approaches for specifying and 
implementing safety-critical hardware and software systems and for 
verifying correctness and completeness of their design and implemen- 
tation. Typical of these approaches are those taken by professor Nancy 
Lynch at MIT and professor Nancy Leveson at the University of Ore- 
gon. These approaches are best applied when the system is defined at a 
higher level of detail than the IAPR concept that we are investigating. 

► Information security. 

For safety -critical information exchanges, for example Automatic De- 
pendent Surveillance-Broadcast (ADS-B) for local air traffic status 
between aircraft when air separation responsibility is transferred to pi- 
lots, the security and integrity of the exchanged information are clearly 
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critical. One way to view information security is in terms of protecting 
the computers and communications assets of the system. There are 
several protection mechanisms: protection against unauthorized altera- 
tions of the data and protection against denial of exchange of data. To 
date, little has been done in the development of models of information 
security and its impact on safety-critical functions. This is an area for 
research and is not addressed further here. 

Objective: A Unified Framework 
for Integrated Safety Analysis 

How best can the many approaches to evaluating system safety be combined into a 
unified framework for safety analysis? Drawing the three perspectives shown in 
Figure 1-1 into a unifying framework that directly addresses the interactions and 
coupling among those perspectives, our first step toward such a unification is de- 
scribed below and has been applied in evaluating the IAPR concept. As time and 
experience in applying this unified approach evolve, we anticipate further refine- 
ments will be made. 

The integrated safety analysis that we employ is distinguished by its ability to 
merge system design or functionality information with a parameterization of a 
system’s situation. This is illustrated in Figure 1-2. The “system” may include 
both air and ground subsystems within this analysis framework. 

In Figure 1-1, we see that system safety is being addressed from a variety of per- 
spectives, each of which impacts safety. These include 

♦ system functionality, the analysis of how reliably the system components 
perform; 

♦ rules and procedures, the analysis of how the system is designed to re- 
spond in both safe and unsafe situations; and 

♦ operational scenario, the analysis of the environment in which the system 
is expected to operate. 

Integrating models that quantify each one of these three elements creates an analy- 
sis capability that is now systemwide and responsive to ongoing changes in the 
definition and requirements of the operational concept. 

The steps leading from requirements derived for an operational concept to the de- 
velopment of a reliability model of the system architecture that has been proposed 
to meet those requirements are shown on the left side of Figure 1-2. This repre- 
sents a traditional reliability/safety modeling process. On the right are the models 
required to capture the environment in which the system is to operate as well as 
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Introduction 


the interaction of those environmental models with response models that represent 
the execution of the rules and procedures that have been developed for the candi- 
date concept. This represents a modeling process for the dynamic analysis of the 
system’s situation. 

Figure 1-2. Integrated Safety and Reliability Modeling and Evaluation 



Infrastructure “ ilities " Metrics 


Operations Safety Metrics 


Our approach to system safety analysis results from the integration of the Reli- 
ability model and the Interaction-Response model. The Interaction-Response 
model provides information regarding the frequency of encounters and the pre- 
dicted outcome of those encounters as a function of the system’ s alerting system 
and ability to resolve encounters. The Reliability model provides, as a function of 
time, probabilities associated with the critical systems’ availability and failure 
states. Scaling the operations safety metrics from the Interaction-Response model 
by the system state probabilities from the Reliability model creates the system- 
level safety statistics. This process is illustrated in Figure 1-3. 
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Figure 1-3. Combining Model Outputs 


Probability vector 
from Reliability Model 

[ P (1 .t). p(2,t) ...] 


Performance metrics from 

Interaction-Response Model 

• Correct rejection 

• Correct detection 

• Unnecessary alert 

• Missed detection 

• Late alert 

• Induced collision 

Products of this analysis include 

♦ predicted accident statistics, 

♦ predicted false alarm statistics, and 

♦ predicted system availability and reliability. 

Moreover, as the operational concept evolves, the impact of changes in system 
architecture, rules and procedures, and operational scenarios can be easily re- 
evaluated with this methodology. 


Conditionally scale 
metrics by 
state probabilities 


Weighted system 
safety statistics 

• Reliable operation 

• Collisions 

• False alarms 
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Chapter 2 

Overview of Analytical Reliability Modeling 
Techniques 


The analytic approaches to quantifying system reliability fall into three classes: 
Monte Carlo simulations, combinatorial reliability models (see below), and 
Markov modeling. The strengths and weaknesses of these three approaches are 
briefly described in this section. 

MOnte Carlo Simulations 

Simulation [1,2] can be used to determine reliability by generating failure and 
repair events at times distributed according to the component failure and repair 
rates. These simulations are repeated until statistically significant reliability meas- 
ures are accumulated. A major strength of the simulation approach is its ability to 
analyze very complicated repair and reconfiguration scenarios, with relatively lit- 
tle knowledge required beyond a description of the system to be analyzed. How- 
ever, a key difficulty of this method is that for highly reliable systems the failure 
rate is so low that, in order to accumulate a statistically meaningful number of 
events, a very large number of simulations must be run. While there are means 
(collectively known as variance reduction techniques) of increasing the efficiency 
of the basic method, the underlying difficulty remains a drawback. 

Combinatorial Reliability Models 

Combinatorial reliability models [3] have been widely used. Fault-tree analysis 
[4], for example, has become a standard analytical method for reliability predic- 
tion in a wide variety of applications. This analytical technique combines compo- 
nent failure probabilities, based on the system architecture and redundancy 
management approach, to determine system reliability. Since there is no explicit 
simulation of system operation, the combinatorial technique avoids the deficien- 
cies of the Monte Carlo simulation. There are, however, three limitations to this 
approach. First, the fault tree is constructed to predict the probability of the system 
being in a particular operating condition (for example, a working condition or a 
failed condition). If it is desired to investigate the probability of being in other 
conditions, such as a variety of different operating modes, then new fault trees 
have to be constructed. Second, it is difficult to include events that have order de- 
pendencies, such as repairs and explicit modeling of reconfiguration strategies. 
Even in relatively simple systems, there often are subtle sequence dependencies. 
Finally, the nature of the combinatorial analysis requires that all combinations of 
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events for the entire time period must be included. For complex systems, this re- 
sults in a complicated fault tree that is difficult to construct and validate. 

Markov Modeling 

Markov modeling techniques have been increasingly used for reliability prediction 
[3, 5, 6], These techniques have also been used successfully to aid in the design of 
fault-tolerant systems [7, 8]. A Markov reliability model calculates the probability 
of the system being in various states as a function of time. A state in the model 
represents the system status with respect to component failures and the behavior 
of the system’s redundancy management strategy. Transitions from one state to 
another occur at given transition rates, which reflect component failure and repair 
rates and redundancy management performance. Each element in the model's state 
vector represents the time-dependent probability of the system being in a specific 
state. Since the Markov model traces the evolution of state probabilities based on 
the transition rates mentioned above, it is not explicitly simulating the system and, 
therefore, does not have the deficiencies associated with the Monte Carlo tech- 
nique. The Markov model is cast into a system of differential equations. Sequence 
dependencies, such as repairs and redundancy management decisions, are in- 
cluded naturally. Furthermore, the differential nature of the model means that it is 
not necessary to generate explicitly all possible combinations of events that can 
occur over the entire time period in question; rather, it is only necessary to model 
events that can occur during an infinitesimal time step. Of course, there are also 
some drawbacks to this method. First, the state space grows exponentially with 
the number of components. However, techniques have been developed to render 
this problem tractable in many situations of interest [9]. Second, treatment of 
complex mission scenarios and repair strategies, although possible, are generally 
cumbersome. 

It should be emphasized that the reliability of a system does not depend on the 
analytical method used to evaluate it, so long as any approximations and simplifi- 
cations are consistently applied or interpreted. 
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Chapter 3 

The Modeling Process 


Role of the Reliability Model 

The objective of the reliability model is to predict the state of the aircraft capa- 
bilities at the start of and during an independent approach. In general, when an 
aircraft lines up for an independent approach, it will have been in-flight for sev- 
eral hours. Assuming that the aircraft had no failures prior to takeoff, in the time 
from takeoff until the start of the approach, failures of components within the 
systems of the aircraft may have occurred that have reduced its capabilities. The 
reduced capabilities, possibly undetected by the pilot, can affect the performance 
of the aircraft during the approach and result in the aircraft drifting or blundering 
into the path of an aircraft approaching the adjacent runway. Alternately, the com- 
ponent failures during en route flight may prevent an independent approach from 
taking place. Procedural rules may prohibit the pilot from attempting an inde- 
pendent approach if there is a known loss of a specific aircraft capability or, in the 
worst case, failures could cause the loss of the aircraft. The reliability model cal- 
culates the probabilities of the reduced capabilities impacting the safety of the air- 
craft when an independent approach is attempted. 

The actual process of generating a reliability model requires information on ar- 
chitecture, component characteristics, operational requirements, and reconfigura- 
tion procedures. The system architecture provides information such as what 
components exist and how they are connected, both physically and logically. The 
model also needs to be fed various component characteristics, such as failure and 
repair rates. The operational requirements provide a definition of what equipment 
or abilities are needed to achieve an operational state. The reconfiguration proce- 
dures are the actions taken when a failure occurs so that system operation remains 
in the most desirable mode. 

The process of generating a reliability model for a system can be divided into 
three steps. First, the system needs to be carefully examined. The goal is to dis- 
cover how the system operates and its critical elements. This step results in a sys- 
tem description. Second, the impact of failures is explored. This step is often 
called a failure modes and effects analysis (FMEA). During this step, the failure 
modes of the system are delineated. Third, the Markov model is constructed. In- 
formation on system operation from step one is used to guide modeling decisions 
such as the proper representation for the human elements. The model is a system- 
atic representation of the FMEA from step two. Each of these steps is discussed in 
a subsection below. 
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Define Aircraft Functional Elements 


The first step in developing the reliability model needed for the IAPR system 
safety model is to define the aircraft functions that directly and uniquely impact 
the inputs of the Interaction-Response model. The functions or capabilities of the 
aircraft used in the IAPR system safety model are defined in Table 3-1. These 
functions were developed by reviewing the current status of the development of 
the Airborne Information for Lateral Spacing system (AILS). However, the func- 
tion definitions and the system description of the IAPR system presented in the 
next section are not strictly based on the AILS system. The function definitions 
and the system description represent the capabilities and components, respec- 
tively, which are likely to comprise an IAPR system, since a specification of an 
AILS system does not yet exist. 

Table 3-1. IAPR System Functional Elements 


Element 

Description 

IAPR RNP 

The ability to perform conformance monitoring of an 
aircraft’s performance and adherence to its ap- 
proach path (Required Navigation Performance, or 
RNP) 

ADS-B/surveillance data link 

The ability of an aircraft to broadcast, receive, and 
process ADS-B information for situational aware- 
ness, conflict avoidance, and airspace management 

Collision-alerting avionics 

The ability of an aircraft’s system to predict a prob- 
able collision with another aircraft during approach 
and landing and to provide timely and reliable alerts 
so that the pilot can avoid the collision (this includes 
alerting logic, processing, and display monitors) 

Guidance and control 

The aggregate of all other aircraft capabilities (e.g., 
propulsion, flight control, engine control) and sup- 
port subsystems exclusive of the previous three 
functions 

Pilot 

The capability of the pilot(s) to safely operate the 
aircraft 


The function definitions are limited to the capabilities of a single aircraft. The 
LAPR system is an aircraft-based collision avoidance system, but there may be de- 
pendencies on systems external to the aircraft that can affect safety. The depend- 
encies with the aircraft that may be approaching the adjacent runway will be 
accounted for because the same function definitions are applied to the adjacent 
aircraft. The dependencies on systems exclusive of the two aircraft are not in- 
cluded in the reliability model. These would include any monitoring and interac- 
tion from the ground controller or interaction with other aircraft in the airport 
area. 
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The functions defined in Table 3-1 are the capabilities of the aircraft required for 
an independent approach. The first three functions represent capabilities that need 
to be added to present commercial aircraft to support IAPR. The fourth function, 
guidance and control, represents all the capabilities and systems of the aircraft, 
exclusive of those required for the first three functions, which can affect safety of 
an independent approach. The fifth function isolates the capability the pilot (and 
crew) provide in the safe operation of the aircraft. 

Describe the System 

The following system description defines the reliability characteristics of the 
IAPR system. That is, the system description presented defines the individual 
components that can fail, how they are interconnected, the redundancy of the 
components and subsystem functions, and the redundancy management logic. 

To demonstrate the safety analysis methodology, a low-fidelity description of a 
plausible IAPR system has been created. A design for the IAPR system presently 
does not exist. So, a system is created that provides the functionality expected for 
an IAPR system and includes some degree of fault tolerance. The system descrip- 
tion constructed is complex enough to demonstrate the application of the safety 
analysis methodology, but simple enough so that minimal resources would be 
needed to develop the reliability model. The low-fidelity model does not limit the 
approach. 

Each system component in the system description is assigned to only one function 
to maintain the independence of the functions. The advantage of maintaining the 
independence of the functions is that it enables the probability of any system state 
to be computed in a simple and direct manner. For example, the probability of the 
system being fully operational , at some time t, is simply the product of the prob- 
abilities of each of the functions being in their fully operational states at time t. 

Figures 3-1 through 3-4 present the block diagrams for the system description. 
These are discussed in the following subsections. However, to fully understand 
the block diagrams, several conventions need to be defined. 

♦ Components shown with broken lines are assigned to another function. 
They are included in the following block diagrams of some of the func- 
tions to indicate the interconnection between the components of different 
functions, and they are not considered one of the components necessary for 
the function. 

♦ Overlapping blocks indicate dual-redundant components. Dual-redundant 
components are both on-line if functional, but only one is necessary for the 
function to be fully operational. 
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♦ The connections between components shown should be understood to in- 
dicate that the connected components are fully cross-strapped. For exam- 
ple, in Figure 3-1 the connection between the “navigation processors” and 
the “navigation displays” (indicated by the arrow) means each of the two 
navigation processors is connected to each of the navigation displays. 


Figure 3-1. 1APR RNP 
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Figure 3-2. ADS-B/Surveillance Data Link 
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Figure 3-3. Collision-Alerting Avionics 
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Independent Approaches on Parallel Runways Required 
Navigational Performance 

Figure 3-1 presents the block diagram of the IAPR RNP system. The six compo- 
nents shown framed with solid lines provide the IAPR RNP function defined in 
Table 3-1. The Global Positioning System (GPS) receiver and Inertial Navigation 
System (INS) provide the sensed position of the aircraft. The GPS receiver pro- 
vides discrete position updates at fixed intervals in time. The INS data are inte- 
grated with the position updates from the GPS receiver to provide a more frequent 
position update than can be obtained with the GPS receiver alone. The data fusion 
and the navigation computation are done in the navigation processor. The naviga- 
tion displays provide the flight crews with the navigation information and alerts 
when navigation containment is violated. 

Table 3-2 presents the operational states of the IAPR RNP function that are perti- 
nent to the IAPR safety model. The IAPR RNP system is fully operational if both 
the GPS receiver and the INS 1 navigation processor and 1 navigation displays are 
functional. The system is “degraded” if either the GPS or INS has failed, the fail- 
ures are detected and compensated for, and an indication has been given to the 
pilot by the system. The “failed safe” state is the state of the system when compo- 
nent failures have caused the loss of the IAPR RNP navigation function and an 
indication is provided to the pilot to evidence that this capability no longer is 
available. Alternately, the “failed uncovered” state represents the loss of the func- 
tion, but an indication is not provided to the pilot to indicate the loss of that func- 
tion. 
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Table 3-2. IAPR RNP Navigation Operational States 


State 

Definition 

Impact 

Fully operational 

TSE (total system error) is less 
than containment limit and no 
alert of loss of RNP capability 

Navigation capability available for 
normal approach; ideal distributions 

Degraded 

Loss of either GPS or INS result- 
ing in a degraded navigation ca- 
pability 

Navigation capability available for 
normal approach; nonideal 
distributions 

Failed safe 

Alert of loss of RNP capability 

No longer able to perform independ- 
ent approaches; approach aborted 

Failed uncovered 

TSE is greater than containment 
limit and no alert of loss of RNP 
capability 

Invalid self-knowledge and 
broadcast of navigation data 


ADS-B/Surveillance Data Link 

Figure 3-2 shows the block diagram of the ADS-B/surveillance data link system. 
The ADS-B/surveillance data link system transmits the IAPR state variable data 
for the aircraft (which the aircraft performing an independent approach on the ad- 
jacent runway can monitor) and receives the IAPR state variable data from the 
adjacent aircraft. The IAPR state variable data broadcast from the aircraft enables 
the collision alerting avionics of other aircraft to predict a collision. Conversely, 
the IAPR state variable data the aircraft receives from other aircraft enables it to 
predict a collision with those aircraft. The Attitude Heading Reference System 
(AHRS), GPS receiver, and INS provide the sensor data that make up the IAPR 
state variable data. However, these three sensors provide redundant information 
and sufficient data are available if two of the three are functional. (Note that the 
GPS receiver and the INS are not included in the ADS-B/surveillance data link 
function, having already been included in the IAPR RNP navigation function.) 

For the ADS-B/surveillance data link function to be fully operational, 1 ADS-B 
processor, 1 ADS-B displays, the modulator and transmitter, the receiver and de- 
modulator, and the antenna must be functional. The degraded, failed safe, and 
failed uncovered states are defined in Table 3-3. 
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Table 3-3. ADS-B/Surveillance Data Link Operational States 


State 

Definition 

Impact 

Fully operational 

Valid broadcast and reception of 
broadcasts from other aircraft 

Transmit and receive functions are 
fully available 

Degraded 

Unable to receive broadcasts 
from other aircraft and may or 
may not receive alert of capability 
loss; broadcast capability func- 
tioning 

Knowledge of other aircraft is in- 
valid but approach is allowed 

Failed safe 

Invalid broadcast and alert of ca- 
pability loss and, possibly also, 
loss of reception capability of 
broadcasts from other aircraft 

No longer able to perform inde- 
pendent approaches; approach 
aborted 

Failed uncovered 

Invalid broadcast and no alert of 
capability loss 

Other aircraft do not receive valid 
surveillance data 


Collision- Alerting Avionics 

The collision-alerting avionics block diagram and operational states are shown in 
Figure 3-3 and Table 3-4, respectively. The collision alerting avionics is fully op- 
erational if 1 alerting processor and 1 alerting displays are functional. The alerting 
processor receives the position of its own aircraft from the IAPR RNP function 
and the IAPR state variable data from the aircraft approaching on the adjacent 
runway from the ADS-B/surveillance data link system. 

Table 3-4. Collision-Alerting Avionics Operational States 


State 

Definition 

Impact 

Fully operational 

Collision-alerting capability func- 
tioning properly 

Alerting-capability available for 
normal approach 

Failed safe 

Collision-alerting not available 
and alert of capability loss 

No longer able to perform inde- 
pendent approaches; approach 
aborted 

Failed uncovered 

Collision-alerting not available 
and no alert of capability loss 

Unable to detect blunders of other 
aircraft but approach is not aborted 


Guidance and Control and Pilot 

Figure 3-4 shows the block diagram of the Guidance and Control and Pilot sys- 
tems. The pilot and crew are included here as the block denoted “Pilot.” The 
Guidance and Control system simply represents all of the systems of the aircraft 
exclusive of the IAPR RNP, ADS-B/surveillance data link, and Collision- Alerting 
Avionics systems that all impact safety. The pilot provides inputs to engine and 
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flight control to ultimately direct the thrust and flight path of the aircraft. Propul- 
sion is provided via the engines. Engine control is provided by the engine proces- 
sor using input from the pilot and engine sensors. Flight control is through the 
control processor, which moves the control surfaces based on inputs from the pi- 
lots and aircraft state and environment sensors. The alternator and PDUs generate 
and distribute electrical power to all components requiring it. 

Figure 3-4. Guidance and Control and Pilot Systems 


To all components 
requiring electrical power 



Table 3-5 presents the operational states of the guidance and control function. The 
guidance and control system is fully operational if 1 engine sensors, 1 aircraft 
state and environmental sensors, 1 engine processor, 1 control processor, 1 guid- 
ance and control displays, both engines, 1 control surfaces, and 1 alternator and 
PDU are functional. The failed safe state would result from the covered failure of 
1 engine. Any uncovered failures or covered failures that result in the system not 
satisfying the definition of fully operational would place the guidance and control 
function in the failed uncovered state. 

Table 3-5. Guidance and Control Operational States 


State 

Definition 

Impact 

Fully operational 

All other capabilities and support 
subsystems operational 

Capability is fully available for nor- 
mal approach 

Failed safe 

Loss of sufficient capability and 
knowledge of loss 

No longer able to perform inde- 
pendent approaches; approach 
aborted 

Failed-uncovered 

Loss of sufficient capability and 
no knowledge of loss or inability 
to control aircraft 

Worst-case blunder 
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The Modeling Process 


Table 3-6 presents the operational states of the pilot function. The pilot function is 
meant to capture the effect of human error in the safety of an independent ap- 
proach. While an actual model of the reliability of the human in control of the air- 
craft is beyond the level of work being presented here, the pilot function can still 
be broken down into operational states to demonstrate how the reliability of the 
human is integrated into the safety analysis methodology. 

Table 3-6. Pilot Operational States 


State 

Definition 

Impact 

Fully operational 

Pilot functioning nominally with- 
out any faults 

Alerting capability available for 
normal approach 

Recoverable fault 

Pilot fault has occurred; is pos- 
sible to recover from fault 

No impact prior to approach; air- 
craft blunder after start of ap- 
proach 

Nonrecoverable fault 

Pilot fault has occurred; is not 
possible to recover from fault 

No impact prior to approach; air- 
craft blunder after start of ap- 
proach 


Build a Markov Model 

A set of Markov reliability models are constructed from the system described ear- 
lier in the system description section. The Markov models are developed in accor- 
dance with the techniques presented in Chapter 4. A separate model is constructed 
for each function defined in Table 3-1. 

Results and Discussion 

Markov reliability models are used to calculate the probabilities of being in the 
operational states of each of the functions. Table 3-7 presents the baseline failure 
rates and coverage probabilities for each of the components identified in the sys- 
tem description for the IAPR system. The failure rates and coverage probabilities 
constitute nearly all of the input parameters for the models. The only missing in- 
put parameter is recovery rate from an intermittent human failure for the Pilot 
model. The baseline value for this rate is set at 3.6xl0 2 recoveries/hour. 

Table 3-7. Baseline Failure Rates and Coverage Probabilities 


Component 

Failure rate (failures/hour) 

Coverage probability 

IAPR RNP 

GPS 

3.0E-5 

0.99 

INS 

1 .OE-4 

0.99 

Navigation displays 

2.0E-5 

0.999, 0.99 

Navigation processor 

1 .OE-5 

0.99, 0.95 
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Table 3-7. Baseline Failure Rates and Coverage Probabilities (Cont.) 


Component 

Failure rate (failures/hour) 

Coverage probability 

ADS-B/Surveillance Data Link 

AHRS 

1.0E-5 

0.99 

ADS-B displays 

2.0E-5 

0.999, 0.99 

ADS-B processor 

1.0E-5 

0.99, 0.95 

Modulator and transmitter 

5.0E-5 

0.99 

Receiver and demodulator 

5.0E-5 

0.99 

Antenna 

1.0E-6 

1.00 

Collision-alerting logic 

Alerting displays 

2.0E-5 

0.999, 0.99 

Alerting processor 

1.0E-5 

0.99, 0.95 

Guidance and control 

Engine sensors 

4.0E-5 

0.99 

Engine processor 

1.0E-5 

0.99 

Engine 

1.0E-5 

0.999 

Alternator and PDU 

2.0E-5 

0.99 

Guidance and control displays 

2.0E-5 

0.999 

State and environment sensors 

4.0E-5 

0.99 

Control processor 

1.0E-5 

0.99 

Control surfaces 

5.0E-6 

0.99 

Pilot 

Intermittent human failure 

1.0E-4 

1.00 

Permanent human failure 

1.0E-6 

1.00 


Note: For coverage probabilities entered as two numbers, the first number is the coverage 
probability of first failure in redundant components, and the second number is for second failure in 
the redundant components. 


The input parameters used are not from any specific source and are selected with 
the intent of highlighting the fidelity of the Markov reliability models. Typical 
values of failure rates and coverage probabilities are assigned for the components 
that are likely to comprise the system. The failure and recovery rates for the pilot 
model are not based on any empirical data. 

Table 3-8 shows the calculated probabilities for the operational states of each 
function. The Markov models are evaluated using Version 7.9.8 of the SURE Re- 
liability Analysis program developed by NASA Langley Research Center [12]. 
The Markov model state probabilities are calculated for 4 and 10 hours. These 
represent two time intervals from aircraft takeoff to the lineup point for an inde- 
pendent approach. 
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Table 3-8. Probabilities of Operational States 



Probability of being operational state 


At 4 hours 

At 10 hours 

Operational state 

Lower 

bound 

Upper 

bound 

Lower 

bound 

Upper 

bound 

IAPR RNP 

Fully operational 

9.9948E-1 

9.9948E-1 

9.9870E-1 

9.9870E-1 

Degraded 

5.15E-4 

5.15E-4 

1.29E-3 

1 .29E-3 

Failed safe 

5.49E-8 

5.49E-8 

3.43E-7 

3.43E-7 

Failed uncovered 

6.16E-6 

6.16E-6 

1.54E-5 

1 .54E-5 

ADS-B/surveillance data link 

Fully operational 

9.9959E-1 

9.9960E-1 

9.9899E-1 

9.9899E-1 

Degraded 

2.00E-4 

2.00E-4 

5.00E-4 

5.00E-4 

Failed safe 

2.02E-4 

2.02E-4 

5.05E-4 

5.05E-4 

Failed uncovered 

2.96E-6 

2.96E-6 

7.40E-6 

7.41 E-6 

Collision-alerting logic 

Fully operational 

1 .0000E+0 

1.0000E+0 

1.0000E+0 

1.0000E+0 

Failed safe 

7.83E-9 

7.83E-9 

4.90E-8 

4.90E-8 

Failed uncovered 

9.60E-7 

9.60E-7 

2.40E-6 

2.40E-6 

Guidance and control 

Fully operational 

9.9991 E-1 

9.9991 E-1 

9.9977E-1 

9.9978E-1 

Failed safe 

7.99E-5 

8.00E-5 

2.00E-4 

2.00E-4 

Failed uncovered 

1 .03E-5 

1.03E-5 

2.60E-5 

2.61 E-5 

Pilot 

Fully operational 

1 .0000E+0 

1.0000E+0 

9.9999E-1 

9.9999E-1 

Recoverable failure 

2.78E-7 

3.59E-7 

2.78E-7 

7.83E-7 

Nonrecoverable failure 

4.00E-6 

4.00E-6 

1.00E-5 

1 .00E-5 


Note that the results in Table 3-8 are presented as bounds on the probabilities of 
being in the states of each function. The bounds occur from two sources. The first 
source, which affects all of the models, is that the SURE program calculates and 
outputs the bounds of the probability of being in the states of the model 
(numerical approximation error). The second source, which affects just the ADS- 
B/surveillance data link and guidance and control Markov models, is the model 
truncation aggregation technique used to limit the size of these models. Model 
truncation introduces some uncertainty into the predictions [11]. 

The probabilities shown in Table 3-8 are used by the Impact Model discussed in 
Chapter 4. However, there are some system probabilities produced by the Markov 
reliability models that are also of interest. Some component failures occurring be- 
fore the approach lineup can preclude an independent approach. Table 3-9 pres- 
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ents two metrics of interest. The first is the probability that insufficient capability 
is available to attempt an independent approach and the approach is aborted by the 
pilot. This is the probability that one or more of the functions, excluding the Pilot 
function, is in its failed safe operational state. The second metric is the probability 
of a loss of the aircraft before the approach lineup. This is the probability of being 
in the failed uncovered operational state of the guidance and control function. 


Table 3-9. Probabilities of Other Operational States 



Probability 

Metric 

Upper bound at 4 hours 

Upper bound at 10 hours 

Insufficient capability 
independent approach 

2.82E-4 

7.06E-4 

Loss of aircraft before 
approach lineup 

1.03E-5 

2.61 E-5 
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Chapter 4 

Markov Modeling Method 


Background 

Markov modeling techniques provide a systematic means of investigating system 
reliability for large, complex systems. They permit the inclusion of sequence- 
dependent events, such as repairs, in a natural fashion. One of the most powerful 
aspects of Markov models is their ability to permit simplifying approximations to 
be made and to provide means to obtain bounds on these approximations. The ba- 
sic concepts of Markov modeling are introduced via simple, but representative 
examples. These examples clearly point out the general flexibility as well as the 
main drawback of the technique, particularly the rapidly proliferating state space. 

Single-Component System 

Figure 4-1 shows a single-component system. The first step in modeling the reli- 
ability of that system is to determine what is required for the system to be in an 
operational state. That single-component system has a trivial operational require- 
ment: it is operational if the single component. A, has not failed. (Conversely, the 
system is failed if component A has failed). While this step is simple for that sys- 
tem, it is often one of the most complicated steps in modeling a complex system, 
characterized by many operational states and subtle interactions among compo- 
nents. 


Figure 4-1. Single -Component System Block Diagram 


A 


Given the system operational requirements, the next step is to construct Markov 
model states. A state represents a unique configuration of failed and operational 
elements, sometimes distinguished by the sequence of the failures that led to it. 
Figure 4-2 shows the Markov model for the one-element system. In general, a 
model is generated by first creating state 1 , the state where there are no failed 
components in the system. The various transitions out of state 1 represent failures 
of the system components, accounted for individually or in groups. In this case, 
there is only one component; thus, a transition denoted X is created leading to 
state 2. This state represents this system when component A is failed. Noting the 
operational requirements for this system, state 2 is labeled as a system failure. 
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Since there is only one component in the system and its failure has been accounted 
for, the Markov model is complete. 

Figure 4-2. Single-Component System Markov Model 


OF IF 



This system’s reliability is just the probability, as a function of time, of being in 
state 1. Actually, there is a probability associated with each state. For example, at 
time zero the probability of being in state 1 (no failures) is 1 (or 100 percent) and 
the probability of being in state 2, or any other state, is 0. Parameter X on the tran- 
sition in the model not only indicates that component A has failed along this tran- 
sition, but that the component’s failure rate is X failures per hour. Throughout this 
discussion, it will be assumed that all failure rates are constant in time. To obtain 
the system reliability as well as other state probabilities of interest as a function of 
time, the probability “flowing ’’ out of state 1 into state 2 needs to be tracked. 

Probability flow is the product of the transition rate and the state probability for 
the state at the origin of the transition. Thus, a state with zero probability has no 
probability flowing out of it, a state with no exiting transitions has no flow out, 
and a state with probability equal to 1 and an exiting transition rate of X has an 
instantaneous flow out equal to X. The rate of change of each probability is then 
given by the net probability flow into the corresponding state. Therefore, a 
Markov model is thus mathematically described by a set of differential equations 
governing the evolution in time of the probabilities of being in each state. 

Using the definition of the probability flows, the following equations are obtained 
for the Markov model shown in Figure 4-2: 


dP x ( t)/dt = - X P x (t) [Eq. 4-1] 

dP 2 (t)/dt = XP x (t) [Eq. 4-2] 

These equations, representing the rate of changes in each state variable (Pj and 
P 2 ), are called state equations. Equation 4-1 shows that the rate of change in prob- 
ability for state 1 is the exiting transition rate X times the probability of being in 
state 1. The minus sign indicates that the transition is out of the state and, there- 
fore, reduces the probability of being in state 1. Equation 4-2 is interpreted simi- 
larly. Note that the flow is into state 2; the positive term indicates an entering 
transition that increases the probability in state 2. Also, the flow into state 2 is the 
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rate X times the probability of state 1 ; the flow on this transition is due to state 1, 
the origin of the transition. Equations 4-1 and 4-2, along with the initial condition 
of the state probabilities, / > j(0) = 1 and P 2 ( 0) = 0, provide a complete description 
of the system ’s reliability. Markov models have the property that a flow leaving 
one state enters another, as shown in Equations 4-1 and 4-2. Hence, the total sys- 
tem probability does not change as the system evolves. This fundamental property 
is called conservation of probability. The sum of all the system’s state probabili- 
ties is always equal to 1 . 

There are many ways of solving Equations 4-1 and 4-2 in closed form, such as 
standard integration or Laplace transform. Using any convenient technique and 
recalling that the failure rate X is constant, yields the following solution: 

P ] (t) = e-h [Eq. 4-3] 

P 2 (t) = 1 - e-h [Eq. 4-4] 

State 1 starts with a probability of 1 and decays exponentially toward 0, while 
state 2 has a probability initially at 0 that grows toward 1 . Notice that the sum of 
the two state probabilities is 1 at all times, thus indicating the conservation of 
probability. 

Two-Component Parallel System 

Figure 4-3 shows a two-component system in which the components are con- 
nected in parallel. The requirement for system operation is that at least one of the 
two components is working. 

Figure 4-3. Two-Component Parallel System Block Diagram 
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A Markov model of this system is shown in Figure 4-4. Notice that states 4 and 5 
distinguish between the two possible orders of component failure leading to a 
system loss. The nature of the Markov model makes such order dependencies easy 
to include. The state equations for this system are 

dP l (t)/dt = -(X A + X B )P l (t) 
dP 2 (t)/dt = X A P\(t) - X B P 2 (t) 

dP 3 (t)/dt = X B P, (0 - X A p 3 (0 [Eq. 4-5] 

dP A (t)/dt = X B P 2 (t) 
dP 5 (t)/dt = X A p 3 (0 

Figure 4-4. Two-Component Parallel System Markov Model 
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For constant failure rates and an initial condition of 

P(0) = [1 0,0 OOF, 

the solution is 

P\{t) = e-(XA + XB)t 
P 2 (f) = e-XBt - e-(XA + XB)t 

P 3 (t) = e-XAt - e-(XA + XB)t [Eq. 4-6] 

P 4 (t) = [X A + X B (e-(XA + XB)t) - (X A + X B )(e-XBt]/(X A + X B ) 

P5(t) = [X B + X A (e-(XA + XB)t) - (X A + X B )(e-XAt]/(X A + X B ) 

As expected, Pj(/), P 2 (t), and P 3 (0 have final values of 0 and the sum of the final 
values of P 4 O) and P 5 O) is 1 . 
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Two-Component Parallel System with 
Imperfect Coverage 

The architecture of a two-component parallel system (Figure 4-3) will be used 
again. In this example, imperfect coverage is added to the description of the sys- 
tem ’s operation. Coverage is the ability to successfully recover from a component 
failure. Coverage for a specific component is represented as the fraction of that 
component 's failures that can be detected and successfully recovered from. 

If the system starts out using component A and a covered failure occurs, then the 
system reconfigures to rely on component B. Any subsequent failure of B results 
in a system failure. If the system is in component A, and there is a failure in com- 
ponent B, then a system loss will result when A fails, since there are no operating 
components left. If the system is using component A and it suffers an uncovered 
failure, by definition the system does not successfully recover, so a system loss 
results. This situation is shown in the Markov model in Figure 4-5. 

Figure 4-5. Parallel System Markov Model with Imperfect Coverage 


OF IF 2F 



States 4, 5, and 6 represent system failure states, and states 2 and 3 represent sys- 
tem operation states where one component is failed. The coverage parameter is c. 
The transition into state 2 represents the covered failure of component A while the 
transition into state 4 represents the uncovered failure of component A. The state 
equations for this Markov model are 

dP l (t)/dt = -(X A + X B ) P,(0 
dP 2 (t)/dt = cX A P ] (t)-l B P 2 (t) 

dP 3 (t)/dt = X B P] (t) - X A P 3 (0 [Eq. 4-7] 

dP 4 ( t)/dt = (l-c)X A P l (t) 
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dP 5 (t)/dt = X B P 2 (t) 
dP 6 (t)/dt = X A P 3 (t) 


If X^, X B , and c are constant in time and the initial condition is 

P( 0) = [1 0 0, 0 0 0]', 


then the solution is 


P\(t) = e-(XA + to)t 

f*2(0 = c[e-tot - e-(XA + to)t] 

P 3 (t) = e-XA t - e-(XA + to)t [Eq. 4-8] 

P 4 (t) = (1 - c)X A [1 - e-(XA + to)tV(X A + X B ) 

P$(t) = c[X A + Xg(e-(XA + to)t - (X^ + Xg) e-tot]/ (X A + Xg) 

P(,(t) = [Xg + X A (e-( X a + to)t - (X A + Xg) e-XAt]/(X A + Xg) 

Notice that including the effects of imperfect coverage, some of which are se- 
quential in nature, is straightforward in Markov models. The coverage values may 
come from engineering experience or may be the result of Markov models that 
examine the fault detection, identification, and reconfiguration process used to 
obtain coverage. While a system is still in the design stage, coverage values are 
often taken as constants. As the design progresses, it may be possible to refine the 
coverage values through modeling. The issue of coverage modeling is discussed 
further in reference [10]. 

Two-Component Parallel System with Repairs 

Once again, the architecture of a two-component parallel system (Figure 4-3) will 
be used. In this example, repairs are added to the description of the system’s op- 
eration. Hence, there are transitions from this state to states at higher failure levels 
representing failure events, and there are transitions back to lower failure levels 
representing the repair of components. It is assumed that there are two repairmen 
in this system. This enables the repair of two components at their respective repair 
rates at the second failure level. Other repair strategies could also be modeled. 
Notice that repairs are a highly order-dependent phenomenon: a failure must occur 
before a repair can be performed. 



Markov Modeling Method 

This system is shown in the Markov model in Figure 4-6. The failure rates X A and 
X B and the repair rates are p A and p 5 . The state equations Eire formed as before; 
the change in state probability is equal to the flows into and out of that state: 

dP i(t)/dt = -(X A + A, b ) P^t) + p A P 2 (t) + p B P 3 (t) 

dP 2 (t)/ dt = X A Pj(t) - (p A + X B ) P 2 (t) + \i B P 4 (t) + p B P 5 (t) 

dP 3 (t)/dt = X B P, (t) - (X A + p B ) P 3 (t) + p A P 4 (t) + p A P 5 (t) [Eq. 4-9] 

dP 4 (t)/dt = A, B P 2 (t) - (Pa + 1%) P 4 O) 

dP 5 (t)/dt = X A P 3 (t) - (p A + p B ) P 5 (t) 


Figure 4-6. Parallel System Markov Model with Repairs 
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If the initial condition is 


P(0) = [1 0,0 0 0] T 

and if all transition rates are constant in time, then a closed form solution can be 
found. However, the form of this solution is quite complex. For systems with this 
level of complexity, numerical integration solutions Eire easier and less expensive 
to obtain than closed-form analytical solutions. 

A key point to notice is that constant repair rates were used. As is true for constant 
failure rates, these imply that the repair times are exponentially distributed in time 
with a mean time equal to the inverse of the repEiir rate. During a system design, 
this approximation may be adequate. As the design progresses, more information 
may be available to refine the repair distribution. These other distributions will 
result in time-varying repair rates, which the numerical solution of the MEirkov 
model can easily handle. 
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State Space-Reduction Techniques 
for Markov Models 

The Need for State Space Reduction 

When real-world fault-tolerant systems are analyzed, the state space is so large 
that some form of state space reduction is needed to make the analysis tractable. 
Consider a system with 20 components. Furthermore, assume that the order of 
component failure does not impact the system performance. Hence, each state is 
unique in that a specific list of components is failed; the order of these failures is 
not unique. 

At the zero failure level there is one state — no components have failed. At the first 
failure level, there are 20 states representing the single failure of each of the 
20 components. Each state at the first failure level has 19 exit transitions repre- 
senting the failure of the remaining 19 components. Therefore, at the second fail- 
ure level, there are potentially 20x19 states. Removing the pairs that have the 
same 2 components failed, but in different orders, eliminates half of the states. 
Thus, at the second failure level, there are 190 states describing the 190 combina- 
tions of dual failures (20x19/2!). This pattern continues with 1,140 states at the 
third failure level (20x19x18/3!), 4,845 states at the fourth failure level 
(20x19x18x17/4!), out to the 20th failure level where there is one state repre- 
senting all components failed. The total number of states is about 10 6 . 

Although 20 components is not a very large system to analyze, some form of state 
space reduction is needed. Notice that the size of the state space grows exponen- 
tially with the number of components in the system. For example, a three- 
component system has 8 states, a four-component system has 16 states, a five- 
component system has 32 states, etc. In general, a system with n components has 
2 n states if failure order is irrelevant. 

It is clear that this problem of state space size explosion is a serious limitation on 
Markov models. In the following subsections, methods for controlling the size of 
the state space are discussed. One method of reducing the state space, mentioned 
in an earlier section, the detailed processes of fault detection, identification, and 
reconfiguration are not explicitly modeled, but rather, a single parameter, the cov- 
erage value c, is used to capture the performance of the fault-handling process. 
This behavioral decomposition [11] reduces the number of states by capturing the 
consequences of failures, namely the system recovers or does not, and relegates 
the details of this operation to a separate fault-handling model. In the following 
subsections, two other techniques for controlling the state space size are dis- 
cussed. 
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Exact State Aggregation 

In order to control the state space size of a Markov model, exact state aggregation 
is introduced. This technique aggregates two states at a common failure level into 
a single state. If certain conditions are met on the two states’ exit transitions, then 
there are no approximations introduced in the aggregation. 

Figure 4-7 shows a piece of a Markov model. The consequences of aggregating 
states 3 and 4 into one state at the n th failure level are examined. Although there 
may be more transitions (indicated by dashed lines) into and out of states 1, 2, 5, 
and 6 at failure levels (n-1) and (n+1), respectively, they will not impact the ag- 
gregation since they do not interact directly with the states to be aggregated. How- 
ever, all transitions into and out of the two states to be aggregated are depicted in 
this figure. This example easily extends to situations where there are more enter- 
ing and exiting transitions for the states to be aggregated. The state equations for 
the model in Figure 4-7 are shown with the extraneous transitions to and from 
states 1, 2, 5, and 6 omitted 

dP x (t)!dt = - (a + b) P,(r) 

dP 2 (t)/dt = -(c + d) P 2 (t) 

dP 3 (t)/dt = a P x (t) + c P 2 (t) - (w + jc) P 3 (t) [Eq. 4-10] 

dP 4 (t)/dt - b P x (t) + d P 2 (t) - (y + z) P 4 (t) 

dP 5 (t)/dt = w P 3 (t) + y P 4 (t) 
dP 6 (t)/dt = xP 3 (t) + Z P 4 (t) 

Figure 4-7. System Before Aggregation 


(n-l)F nF (n+l)F 



Figure 4-8 shows the same portion of the Markov model but the two states at the 
n* failure level have been aggregated into one state (state a.3). States in the ag- 
gregated model are denoted with an “a” in their designations. The state equations 
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for the aggregated system, once again ignoring the extraneous transitions at the 
(n-1) and (n+1) failure levels are: 


d/> al (/)/d? = -aP al (0 
JP a2 (0/di = -pP a . 2 (0 

dP a3 (t)/dt = «P a] (0 + P P a . 2 (0 - (Y + 8) P a.3(0 [Eq. 4-11] 

dP a . 5 (t)/i dt = yp a . 3 (0 
dP a6 (tydt = sp aJ (t) 

Figure 4-8. System After Aggregation 


(n-DF nF (n+nF 



The goal is to determine what the transition rates in the aggregated model, a, (3, y, 
and 8 are. To do this, the following state probabilities are set equal: 

♦ P 1 (0 = P a .iW 

♦ p 2(') = p a .2« 

♦ P 3 (t) + P 4 (t) = p a3 (0 

♦ p 5(0 = p a .5(0 

♦ p 6W = /> a.6(0- 

These equalities lead directly to the equality of the differentials: 

♦ dP\(t)/dt = dP a] (t)/dt 

♦ dP 2 (t)/dt = dP a2 (t)/dt 

♦ dP 3 (t)/dt + dP 4 (t)/dt = dP a3 (t)/dt 
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♦ dP 5 (t)/dt = dP a5 (t)/dt 

♦ dP 6 (t)/dt = dP a6 (t)/dt. 

In other words, all states have a one-to-one correspondence in the two models ex- 
cept states 3 and 4, which are aggregated into state a.3. 

Applying the equalities for the differentials for states 1, 2, 5, and 6 (Equa- 
tion 4-10) and their corresponding states in the aggregated model (Equation 4-11) 
gives the following: 

♦ (state 1) a = a + b 

♦ (state 2) P = c + d 

♦ (state 5) y = [w P 3 (t) + y P 4 (t)]/[P 3 (t) + P 4 (t)] 

♦ (state 6) 8 = [x P 3 (t) + z P 4 (t)]/[P 3 (t) + P 4 (t )] 

These equations also satisfy the equalities for the aggregated state a.3. The equa- 
tions for a and P make intuitive sense: the transition rate into the aggregated state 
a.3 is the sum of the rates into 3 and 4 since state a.3 is the combination of states 3 
and 4. The transition rates y and 8 are slightly more complex. 

The flows (i.e., differentials) in the unaggregated and aggregated models must be 
the same. Flows are the product of transition rates and the probability of the state 
at the origin of the transition. Hence, the transition rate leaving the aggregated 
state a.3, going to state a.5, must be the sum of the flows leaving states 3 and 4, 
going to state 5, divided by the probability of being in state a.3 (which equals the 
sum of states 3 and 4’s probability). In other words, the exit transition from an 
aggregated state is a “mix” of the unaggregated states’ exit transitions. Notice that 
in general, the aggregated system 's exit rates y and 8 are time varying functions 
even though the unaggregated system’s rates w, x, y, and z may be time invariant. 

To perform an exact state aggregation consider a special case of the above system. 
If w = y and x = z, then 

♦ y= w = y 

♦ 8 = x = z. 

Performing aggregations in this case gives time-invariant transition rates in the 
aggregated model if the unaggregated model also has time invariant rates. This 
special case is considered exact since no approximations are introduced and the 
aggregated transition rates can be found by inspection: rates into the aggregated 
state are the sum of the rates into the unaggregated states, and the exit rates for the 


4-11 



aggregated state are the same as those to each destination as from the unaggre- 
gated states. Notice that a sufficient condition for exact aggregation is that the two 
states to be aggregated must be at the same failure level and have identical transi- 
tion rates with one another to each of their destinations. This aggregation applies 
for any initial conditions on the unaggregated states. 

Two examples are now presented to show the technique of exact state aggrega- 
tion. First, return to the parallel system with imperfect coverage of Figure 4-5. 
States 5 and 6 are at a common failure level and have identical exit transitions 
(they both have no exit transitions). Hence, these two states can be aggregated. 

The resulting model is shown in Figure 4-9. The state equations are 

dP l (t)/dt = -(X A + X B )P l (t) 

dP 2 (t)/dt = cX A P x (t) - X B P 2 (t) 

dP 3 (t)/dt = X B Pj(0 - X A P 3 (t) [Eq. 4-12] 

dP 4 (t)/dt = (1 - c)X A P x {t) 
dP 5 (t)/dt = X B P 2 (t) + X A P 3 (t) 

Figure 4-9. Aggregated Parallel System with Imperfect Coverage 


OF IF 2F 



If X A , Xg and c are constant in time and the initial condition is 

P(0) = [1 0, 0 0 0]', 


then the solution is: 


Pj(0 = e-(kA + XB)t 
P 2 (t ) = c[e-XBt - e-(kA + XB)t ] 
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P3(0 = e-XAt - e-(KA + XB)t [Eq. 4- 1 3] 

P 4(0 = (1 - c)X A [1 - e-(XA + XB)t]/(X A + Xg) 

P 5 U) = [cX A + Xg + (k A + cXg) e-(kA + A B)t - (A^ + Xg) e-XAt - c(X A + X g) e- 
X Bt] 

/(X A + Xg) 

Comparing this solution with that of the unaggregated case (Equation 4-8) shows 
the result to be identical, except that the aggregated state 5 is the sum of the unag- 
gregated states 5 and 6. 

As a second example, consider the parallel system with repairs (Figure 4-6). 

States 4 and 5 have identical exit transition rates to each of their destinations; both 
have rate |i A going to state 3 and rate (l B going to state 2. Hence, states 4 and 5 

can be aggregated. The resulting model is shown in Figure 4-10 and the state 
equations are as follows: 

dP i(t)/dt = - (X A + X B ) P\(t) + \l A P 2 (0 + \i B P 3(0 
dP 2 (t)/dt = X A P,(f) - (\l A + X B ) P 2 {t) + P 4 (t) [Eq. 4-14] 

dP 3 (t)/dt = A B P ](0 - (X A + \i B ) P 3(0 + )i A P 4 (/) 
dP 4 (t)/dt = A B P 2(0 + X A P 3 (0 - (p. A + |i B ) P 4(0 
Figure 4- JO. Aggregated Parallel System Markov Model with Repairs 
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As was the case before the aggregation was performed, these equations are quite 
complex to solve in closed form. A numerical solution shows correspondence 
between the unaggregated and aggregated solutions. 

Notice that the process of state aggregation reduces the level of detail of the 
model. What were two unique states are now mixed together as one. As long as 
the states to be aggregated share common characteristics that are of interest to the 
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modeler, such as common operating modes, this loss of detail is usually not criti- 
cal. 

Model Truncation 

Another method of controlling the size of the state space is model truncation. Fre- 
quently, in highly reliable systems, the probability of system failure caused by 
many components being failed is much less likely than system failures caused by a 
few specific component failures. Hence, only the most probable failure sequences 
impact the solution and only these sequences need to be modeled in detail. Al- 
though the accuracy of the solution is impacted by the unmodeled failure se- 
quences, bounds on the solution’s accuracy will be directly obtained. 

Figure 4-11 shows a Markov model that has been truncated at the no failure level. 
Thus, any configuration with n or more failures is called a system failure and all 
of these states are lumped into the state SF(n). Clearly, this will lead to an overes- 
timate of the system failure probability for the n* and greater failure levels since 
some of these configurations are operational. Therefore, the state SF(n) will be 
used in establishing an upper bound on the system failure probability. 

Figure 4-11. Truncated Markov Model 


fn-2)F ( n-HF nF 



States SF(n-l) and SF(n-2) are the contributions to the system failure probability 
at the (n-1) and (n-2) failure levels, respectively. The “e” notation on the transi- 
tions leading to states at the (n- 1 ) failure level indicate that these should be exact 
transitions. This results in an exact flow into states at the (n-1) failure level. Thus, 
the probability of the system failure state SF(n-l) is exact. The “a” transition en- 
tering the failure state at the no failure level may be an approximation that is 
greater than, or equal to, the actual transition rate. This results in a larger prob- 
ability flowing into state SF(n) than would be seen in an exact model. This, in ad- 
dition to the assumption that all n and greater failure configurations cause a 
system failure, ensures that state SF(n) contains a higher probability than exists in 
an exact model. 
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The sum of all failure state probabilities up to and including state SF(n-l) pro- 
vides a lower bound on the system failure probability. This is valid since these 
states have probabilities equal to the exact case, and the failure contributions at 
failure levels n and greater have not been included. The upper bound is obtained 
by adding system failure state SF(n) to the lower bound probability. This provides 
an upper bound since flows into state SF(n) are equal to, or greater than, those in 
an exact model and it is an overestimate of the true system failure probability at 
the n and greater failure levels. 

The tightness of these bounds is measured by their difference, SF(n). If this ap- 
proximation is too large for the given problem, then the model must be con- 
structed out to the next failure level. For most highly reliable, integrated control 
systems, three to five failure levels provides sufficient accuracy. Given the exam- 
ple of a 20-component system needing 10 6 states to model all 20 failure levels, 
truncating the model at the third failure level requires 212 states. Hence, there is a 
substantial savings in using model truncation. 

Returning to the truncated model in Figure 4-10, a further reduction in the size of 
the state space can be made. The transition rates from the operational states at the 
(n-1) failure level to the SF(n) state are allowed to be approximations. Therefore, 
it is convenient to set all of the “a” rates equal. The rates “a” must be equal to, or 
greater than, the exact rates. For example, they may be set to the maximum of the 
exact rates or to the sum of all the component rates. The latter case may be inter- 
preted as assuming that any component can fail at the (n-1) failure level, even 
components that have previously failed. This ensures that the rates “a” are identi- 
cal and are greater than the exact exit transition rates. Since the operational states 
at the (n-1) failure level all have the same exit transition rates and they lead to the 
same destination, they can be exactly aggregated into states with common opera- 
tional modes. This aggregation is shown in Figure 4-12. 

Figure 4-12. Truncated Markov Model with Aggregation 
(n-2)F (n-l)F nF 
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The ability to aggregate states at the (n-1) failure level provides a further reduction 
in the size of the state space. For example, the 20-component system that requires 
10 6 states without truncation, and 212 states when truncated at the third failure 
level, takes only 24 states when the second-failure-level states are aggregated into 
a single-system-failure state and a single operational state. 

Numerical Solutions of Markov Models 


As has been seen, moderate-size, fault-tolerant systems (for example, with 
20 components) generate reliability models with approximately 20 to 100 states 
depending on how the model is truncated and aggregated, how many operating 
modes are present, and whether or not the components have imperfect coverage. 
Given models of this size or larger, it is impractical to obtain closed form solu- 
tions. The most efficient means of obtaining results is a numerical solution of the 
Markov model. 


As has been seen in previous sections, the Markov model represents a system of 
differential equations. For example, the parallel system in Figure 4-3 has the state 


equations 

♦ 

dP i (t)/dt = ((X A +X B )P l (t) 

♦ 

dP 2 (t)/dt = X A P x {t) -X B P 2 (t) 

♦ 

dP 3 (t)/dt = X B P l (t) -X A P 3 (t) 

♦ 

dP 4 (t)/dt = X B P 2 (t) 

♦ 

dP 5 (t)/dt = X A P 3 (t) 


These can be written in matrix form: 


dPfi) = 

dt 


RX a + ^b) 






0 

0 


0 

-X 

0 


B 


/Vfi 

0 



0 

0 

0 

0 

0 


ol 

ol 

o ! p(t> 


where the state vector is 

p( t ) = [Pj(o, p 2 (o> pw, p 4 (t), p 5 m 
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Notice that the columns add to zero. This represents a conservation property of the 
system: all flows leaving a state must enter another state, and transitions do not 
store any flow. The matrix equation may be written more concisely as 

dP(t)/dt = A P(t) [Eq. 4-15] 

Equation 4-15 is the continuous-time representation of the Markov model. Matrix 
A is the continuous-time transition Matrix. While there are many ways of numeri- 
cally integrating this equation, one that is particularly fast and accurate will be 
shown [12]. To begin, the continuous differential is approximated with a discrete 
time step: 


[P(*+A/) - P(t)]/At = A P{t) 

P(t+At) = [I+A At] P{t) 

Pit+At) = M Pit) [Eq. 4-16] 

Matrix I is the identity matrix and M is the discrete-time transition matrix. The 
use of the above approximation is called Euler integration. 

Equation 4-16 represents an iterative solution for the Markov model. Given the 
system’s initial condition, P(0), it is possible to use Equation 4-16 to propagate 
the state probability in time: 


P(At) = M P(0) 
P(2At) = MP(At) 
P(3At) = M P(2At) 
P(4At) = M P(3At) 
P(5At) = MP(4At) 


P(nAt) = M P((n-l)At) [Eq. 4-17] 

While this method is sufficiently stable for appropriate Ats, it can be rewritten in a 
form that is faster to solve. If the state transition matrix M is constant in time (i.e., 
failure rates are constant in time) then the following observation can be made: 

♦ PiAt) = M P(0) 

♦ P(2At) = M PiAt) = M M P(0) = M 2 P(0) 

♦ P(3At) = M P(2At) = M M M P(0) = M^ P(0) 

♦ etc. 
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In general, the state probability at time t = m At is 

P(mAt) = M m P(0) [Eq. 4-18] 

It is efficient to represent m as a binary number. The digits of this number repre- 
sent a recipe for the powers of 2 required to obtain M m . For example, if m = 13, 
its binary representation is 1101. Therefore, M 13 can be found by performing two 
steps. First, calculate M 2 , M 4 , and M 8 by repeated squaring of matrix M. Second, 
use the binary representation of 13 to determine which powers of M need to be 
used. In this case, it is the 8 th , 4 th , and 1 st powers that are needed. Performing 
these indicated multiplications gives the result desired: 

M 13 = M 8 M 4 M 

To get a feel for the increase in speed obtained by this method of matrix doubling 
(Equation 4-18) over that of traditional stepping in time (Equation 4-17), consider 
a problem where the state vector dimension is n and the number of time steps re- 
quired is m. For a matrix of dimension n by m, the process of squaring the matrix 
takes on the order of n 3 operations. The process of multiplying a matrix by a vec- 
tor takes on the order of n 2 operations. Hence, traditional stepping in time requires 
(m n 2 ) operations to achieve a result. If m is an integer power of 2, then z matrix 
doubles are required to obtain M m , where z = integer[ln(m)/ln(2)]. Usually, m is 
not an integer power of 2 so powers of M must be multiplied to obtain M m . On 
average, this means that 1 .5z matrix multipliers are needed. Therefore, the matrix 
doubling method requires (1.5 z n 3 ) operations to get a solution. 

Consider an example where the state vector dimension is n = 50. For a mission 
time of t = 1 ,000 hours and a time step of At = 1 minute and m = 60,000 time 
steps. Thus, z = 15. The number of operations for stepping in time (Equa- 
tion 4-17) is 1.5xl0 8 and the number of operations (on average) for matrix dou- 
bling (Equation 4-18) is 2.8xl0 6 . For this case, stepping in time requires more 
than 50 times the number of operations required for matrix doubling. Therefore, 
when implemented on a computer, matrix doubling would be expected to be 
50 times faster than stepping in time. A further improvement in speed can be ob- 
tained by selecting a time step that gives an m that is an integer power of 2. For 
example, At = 0.915 minutes gives m = 2^. Therefore, z = 16, and only 16 matrix 
multiplications are needed to obtain a result. The total operations for matrix dou- 
bling in this case would be 2xl0 6 . This would reduce computation time further by 
a factor of 1.4. 

One final note on integrating the state equations of a Markov model is the need 
for double-precision calculations. Notice that on the diagonals of the discrete-time 
transition matrix are terms like 


1 -XAt 
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where X is a failure rate. Typical values for X and At are X = 10' 5 hr 1 and 
At = 10'2 hr. Thus, the diagonal entry is 


1 - 10- 7 


Using a machine with single precision limits the accuracy of a number to ap- 
proximately seven significant digits. Hence, the diagonal entry would be stored as 
a 1. Squaring the matrix would maintain this 1 on the diagonal for all time. This 
means that the state associated with this diagonal would never have flows leaving 
it, even though flows would be appearing in states “downstream ” of it. This vio- 
lates the conservation of probability required for Markov models. Although the 
solution can be performed in single precision, some subtle renormalization proce- 
dures are required to maintain the conservation of probability. Performing the in- 
tegration in double precision avoids this problem. 
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